Cherry pick #3484 onto 1.19: Serve stale on ongoing throttling #3518

marwanad · 2020-09-16T18:29:18Z

Cherry-picks: #3484

k8s Azure clients keeps tracks of previous HTTP 429 and Retry-After cool down periods. On subsequent calls, they will notice the ongoing throttling window and will return a synthetic errors (without HTTPStatusCode) rather than submitting a throttled request to the ARM API: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/vendor/k8s.io/legacy-cloud-providers/azure/clients/vmssvmclient/azure_vmssvmclient.go#L154-L158 https://github.com/kubernetes/autoscaler/blob/a5ed2cc3fe0aabd92c7758e39f1a9c9fe3bd6505/cluster-autoscaler/vendor/k8s.io/legacy-cloud-providers/azure/retry/azure_error.go#L118-L123 Some CA components can cope with a temporarily outdated object view when throttled. They call in to `isAzureRequestsThrottled()` on clients errors to return stale objects from cache (if any) and extend the object's refresh period (if any). But this only works for the first API call (returning HTTP 429). Next calls in the same throttling window (per Retry-After header) won't be identified as throttled by `isAzureRequestsThrottled` due to their nul `HTTPStatusCode`. This can makes the CA panic during startup due a failing cache init, when more than one VMSS call hits throttling. We've seen this causing early restarts loops, re-scanning every VMSS due to cold cache on start, keeping the subscription throttled. Practically this change allows the 3 call sites (`scaleSet.Nodes()`, `scaleSet.getCurSize()`, and `AgentPool.getVirtualMachinesFromCache()`) to serve from cache (and extend the object's next refresh deadline) as they would do on the first HTTP 429 hit, rather than returning an error.

marwanad · 2020-09-16T18:29:32Z

/area provider/azure

feiskyer

/lgtm
/approve

k8s-ci-robot · 2020-09-17T00:08:47Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: feiskyer

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster-autoscaler/cloudprovider/azure/OWNERS~~ [feiskyer]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 16, 2020

k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. area/provider/azure Issues or PRs related to azure provider labels Sep 16, 2020

k8s-ci-robot requested review from feiskyer and nilo19 September 16, 2020 18:29

feiskyer reviewed Sep 17, 2020

View reviewed changes

k8s-ci-robot assigned feiskyer Sep 17, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 17, 2020

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 17, 2020

k8s-ci-robot merged commit 1e90d80 into kubernetes:cluster-autoscaler-release-1.19 Sep 17, 2020

marwanad deleted the cherry-pick-3484-1.19 branch September 18, 2020 22:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cherry pick #3484 onto 1.19: Serve stale on ongoing throttling #3518

Cherry pick #3484 onto 1.19: Serve stale on ongoing throttling #3518

marwanad commented Sep 16, 2020

marwanad commented Sep 16, 2020

feiskyer left a comment

k8s-ci-robot commented Sep 17, 2020

Cherry pick #3484 onto 1.19: Serve stale on ongoing throttling #3518

Cherry pick #3484 onto 1.19: Serve stale on ongoing throttling #3518

Conversation

marwanad commented Sep 16, 2020

marwanad commented Sep 16, 2020

feiskyer left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Sep 17, 2020